Data processing system and method for predictively selecting a scope of broadcast of an operation utilizing a history-based prediction

ABSTRACT

According to a method of data processing, a predictor is maintained that indicates a historical scope of broadcast for one or more previous operations transmitted on an interconnect of a data processing system. A scope of broadcast of a subsequent operation is predictively selected by reference to the predictor.

CROSS-REFERENCE TO RELATED APPLICATION(S)

The present application is a continuation of U.S. patent applicationSer. No. 11/140,821, filed on May 31, 2005, and entitled “DataProcessing System and Method for Predictively Selecting a Scope ofBroadcast of an Operation Utilizing a History-Based Prediction” which isalso related to U.S. patent application Ser. Nos. 11/054,886 and11/055,697, which are assigned to the assignee of the present inventionand incorporated herein by reference in their entireties.

BACKGROUND OF THE INVENTION

1. Technical Field

The present invention relates in general to data processing and, inparticular, to data processing in a cache coherent data processingsystem.

2. Description of the Related Art

A conventional symmetric multiprocessor (SMP) computer system, such as aserver computer system, includes multiple processing units all coupledto a system interconnect, which typically comprises one or more address,data and control buses. Coupled to the system interconnect is a systemmemory, which represents the lowest level of volatile memory in themultiprocessor computer system and which generally is accessible forread and write access by all processing units. In order to reduce accesslatency to instructions and data residing in the system memory, eachprocessing unit is typically further supported by a respectivemulti-level cache hierarchy, the lower level(s) of which may be sharedby one or more processor cores.

Because multiple processor cores may request write access to a samecache line of data and because modified cache lines are not immediatelysynchronized with system memory, the cache hierarchies of multiprocessorcomputer systems typically implement a cache coherency protocol toensure at least a minimum level of coherence among the various processorcore's “views” of the contents of system memory. In particular, cachecoherency requires, at a minimum, that after a processing unit accessesa copy of a memory block and subsequently accesses an updated copy ofthe memory block, the processing unit cannot again access the old copyof the memory block.

A cache coherency protocol typically defines a set of cache statesstored in association with the cache lines of each cache hierarchy, aswell as a set of coherency messages utilized to communicate the cachestate information between cache hierarchies. In a typicalimplementation, the cache state information takes the form of thewell-known MESI (Modified, Exclusive, Shared, Invalid) protocol or avariant thereof, and the coherency messages indicate a protocol-definedcoherency state transition in the cache hierarchy of the requestorand/or the recipients of a memory access request.

Conventional cache coherency protocols have generally assumed that tomaintain cache coherency a global broadcast of coherency messages had tobe employed. That is, that all coherency messages must be received byall cache hierarchies in an SMP computer system. The present inventionrecognizes, however, that the requirement of global broadcast ofcoherency messages creates a significant impediment to the scalabilityof SMP computer systems and, in particular, consumes an increasingamount of the bandwidth of the system interconnect as systems scale.

SUMMARY OF THE INVENTION

In view of the foregoing, the present invention provides an improvedcache coherent data processing system, cache system and method of dataprocessing in a cache coherent data processing system.

In one embodiment, a predictor is maintained that indicates a historicalscope of broadcast for one or more previous operations transmitted on aninterconnect of a data processing system. A scope of broadcast of asubsequent operation is predictively selected by reference to thepredictor.

All objects, features, and advantages of the present invention willbecome apparent in the following detailed written description.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features believed characteristic of the invention are setforth in the appended claims. However, the invention, as well as apreferred mode of use, will best be understood by reference to thefollowing detailed description of an illustrative embodiment when readin conjunction with the accompanying drawings, wherein:

FIG. 1 is a high level block diagram of an exemplary data processingsystem in accordance with the present invention;

FIG. 2 is a more detailed block diagram of a processing unit inaccordance with the present invention;

FIG. 3 is a more detailed block diagram of the L2 cache array anddirectory depicted in FIG. 2;

FIG. 4 is a time-space diagram of an exemplary transaction on the systeminterconnect of the data processing system of FIG. 1;

FIG. 5 illustrates a domain indicator in accordance with a preferredembodiment of the present invention;

FIG. 6 is a high level logical flowchart of an exemplary method by whicha cache memory services an operation received a processor core in a dataprocessing system in accordance with the present invention;

FIG. 7 is a more detailed block diagram of one embodiment of the scopeprediction logic depicted in FIG. 2;

FIG. 8 is a high level logical flowchart of an exemplary process ofscope prediction in accordance with the present invention; and

FIG. 9 is a more detailed logical flowchart of an exemplary process ofhistory-based scope prediction in accordance with the present invention.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENT I. Exemplary DataProcessing System

With reference now to the figures and, in particular, with reference toFIG. 1, there is illustrated a high level block diagram of an exemplaryembodiment of a cache coherent symmetric multiprocessor (SMP) dataprocessing system in accordance with the present invention. As shown,data processing system 100 includes multiple processing nodes 102 a, 102b for processing data and instructions. Processing nodes 102 a, 102 bare coupled to a system interconnect 110 for conveying address, data andcontrol information. System interconnect 110 may be implemented, forexample, as a bused interconnect, a switched interconnect or a hybridinterconnect.

In the depicted embodiment, each processing node 102 is realized as amulti-chip module (MCM) containing four processing units 104 a-104 d,each preferably realized as a respective integrated circuit. Theprocessing units 104 a-104 d within each processing node 102 are coupledfor communication by a local interconnect 114, which, like systeminterconnect 110, may be implemented with one or more buses and/orswitches.

The devices coupled to each local interconnect 114 include not onlyprocessing units 104, but also one or more system memories 108 a-108 d.Data and instructions residing in system memories 108 can generally beaccessed and modified by a processor core in any processing unit 104 inany processing node 102 of data processing system 100. In alternativeembodiments of the invention, one or more system memories 108 can becoupled to system interconnect 110 rather than a local interconnect 114.

Those skilled in the art will appreciate that SMP data processing system100 can include many additional unillustrated components, such asinterconnect bridges, non-volatile storage, ports for connection tonetworks or attached devices, etc. Because such additional componentsare not necessary for an understanding of the present invention, theyare not illustrated in FIG. 1 or discussed further herein. It shouldalso be understood, however, that the enhancements provided by thepresent invention are applicable to cache coherent data processingsystems of diverse architectures and are in no way limited to thegeneralized data processing system architecture illustrated in FIG. 1.

Referring now to FIG. 2, there is depicted a more detailed block diagramof an exemplary processing unit 104 in accordance with the presentinvention. In the depicted embodiment, each processing unit 104 includestwo processor cores 200 a, 200 b for independently processinginstructions and data. Each processor core 200 includes at least aninstruction sequencing unit (ISU) 208 for fetching and orderinginstructions for execution and one or more execution units 224 forexecuting instructions. Execution units 224 preferably include aload-store unit (LSU) 228 for executing memory access instructions thatreferences a memory block or cause the generation of an operationreferencing a memory block. In a preferred embodiment, each processorcore 200 is capable of simultaneously executing instructions within twoor more hardware threads of execution.

The operation of each processor core 200 is supported by a multi-levelvolatile memory hierarchy having at its lowest level shared systemmemories 108 a-108 d, and at its upper levels one or more levels ofcache memory. In the depicted embodiment, each processing unit 104includes an integrated memory controller (IMC) 206 that controls readand write access to a respective one of the system memories 108 a-108 dwithin its processing node 102 in response to requests received fromprocessor cores 200 a-200 b and operations snooped by a snooper (S) 222on the local interconnect 114. IMC 206 determines the addresses forwhich it is responsible by reference to base address register (BAR)logic 240.

In the illustrative embodiment, the cache memory hierarchy of processingunit 104 includes a store-through level one (L1) cache 226 (which may bebifurcated into separate L1 instruction and data caches) within eachprocessor core 200 and a level two (L2) cache 230 shared by allprocessor cores 200 a, 200 b of the processing unit 104. L2 cache 230includes an L2 array and directory 234 and a cache controller comprisinga master 232 and a snooper 236. Master 232 initiates transactions onlocal interconnect 114 and system interconnect 110 and accesses L2 arrayand directory 234 in response to memory access (and other) requestsreceived from the associated processor cores 200 a-200 b. Master 232includes BAR register 252, which indicates which addresses reside in thesystem memories 108 in its processing node 102, and scope predictionlogic 250, which, as described further below, may be utilized to predictscope of operations transmitted on the interconnect fabric includinglocal interconnects 114 and system interconnect 110. Snooper 236 snoopsoperations on local interconnect 114, provides appropriate responses,and performs any accesses to L2 array and directory 234 required by theoperations.

Although the illustrated cache hierarchy includes only two levels ofcache, those skilled in the art will appreciate that alternativeembodiments may include additional levels (L3, L4, L5 etc.) of on-chipor off-chip in-line or lookaside cache, which may be fully inclusive,partially inclusive, or non-inclusive of the contents the upper levelsof cache.

Each processing unit 104 further includes an instance of response logic210, which implements a portion of the distributed coherency signalingmechanism that maintains cache coherency within data processing system100. In addition, each processing unit 104 includes an instance ofinterconnect logic 212 for selectively forwarding communications betweenits local interconnect 114 and system interconnect 110. Finally, eachprocessing unit 104 includes an integrated I/O (input/output) controller214 supporting the attachment of one or more I/O devices, such as I/Odevice 216. I/O controller 214 may issue operations on localinterconnect 114 and/or system interconnect 110 in response to requestsby I/O device 216.

With reference now to FIG. 3A, there is illustrated a more detailedblock diagram of an exemplary embodiment of L2 array and directory 234.As illustrated, L2 array and directory 234 includes a set associative L2cache array 300 and an L2 cache directory 302 of the contents of L2cache array 300. As in conventional set associative caches, memorylocations in system memories 108 are mapped to particular congruenceclasses within cache arrays 300 utilizing predetermined index bitswithin the system memory (real) addresses. The particular cache linesstored within cache array 300 are recorded in cache directory 302, whichcontains one directory entry for each cache line in cache array 300. Asunderstood by those skilled in the art, each directory entry in cachedirectory 302 comprises at least a tag field 304, which specifies theparticular cache line stored in cache array 300 utilizing a tag portionof the corresponding real address, a state field 306, which indicatesthe coherency state of the cache line, and a LRU (Least Recently Used)field 308 indicating a replacement order for the cache line with respectto other cache lines in the same congruence class.

II. Exemplary Operation

Referring now to FIG. 4, there is depicted a time-space diagram of anexemplary operation on a local or system interconnect 110, 114 of dataprocessing system 100 of FIG. 1. Although interconnects 110, 114 are notnecessarily bused interconnects, operations transmitted on one or morelocal interconnects 114 and/or system interconnect 114 are referred toherein as “bus operations” to distinguish them from CPU requeststransmitted between processor cores 200 and the cache memories residingwithin their own cache hierarchies.

The illustrated bus operation begins when a master 232 of an L2 cache230 (or another master, such as an I/O controller 214) issues a request402 on a local interconnect 114 and/or system interconnect 110. Request402 preferably includes a transaction type indicating a type of desiredaccess and a resource identifier (e.g., real address) indicating aresource to be accessed by the request. Common types of requestspreferably include those set forth below in Table I.

TABLE I Request Description READ Requests a copy of the image of amemory block for query purposes RWITM (Read-With- Requests a unique copyof the image of a memory block with the intent Intent-To-Modify) toupdate (modify) it and requires destruction of other copies, if anyDCLAIM (Data Requests authority to promote an existing query-only copyof memory Claim) block to a unique copy with the intent to update(modify) it and requires destruction of other copies, if any DCBZ (DataCache Requests authority to create a new unique cached copy of a memoryBlock Zero) block without regard to its present state and subsequentlymodify its contents; requires destruction of other copies, if anyCASTOUT Copies the image of a memory block from a higher level of memoryto a lower level of memory in preparation for the destruction of thehigher level copy WRITE Requests authority to create a new unique copyof a memory block without regard to its present state and immediatelycopy the image of the memory block from a higher level memory to a lowerlevel memory in preparation for the destruction of the higher level copyPARTIAL WRITE Requests authority to create a new unique copy of apartial memory block without regard to its present state and immediatelycopy the image of the partial memory block from a higher level memory toa lower level memory in preparation for the destruction of the higherlevel copy

Request 402 is received by the snooper 236 of L2 caches 230, as well asthe snoopers 222 of memory controllers 206 (FIG. 1). In general, withsome exceptions, the snooper 236 in the same L2 cache 230 as the master232 of request 402 does not snoop request 402 (i.e., there is generallyno self-snooping) because a request 402 is transmitted on localinterconnect 114 and/or system interconnect 110 only if the request 402cannot be serviced internally by a processing unit 104. Each snooper222, 236 that receives request 402 may provide a respective partialresponse 406 representing the response of at least that snooper torequest 402. A snooper 222 within a memory controller 206 determines thepartial response 406 to provide based, for example, whether the snooper222 is responsible for the request address and whether it has resourcesavailable to service the request. A snooper 236 of an L2 cache 230 maydetermine its partial response 406 based on, for example, theavailability of its L2 cache directory 302, the availability of a snooplogic instance within snooper 236 to handle the request, and thecoherency state associated with the request address in L2 cachedirectory 302.

The partial responses of snoopers 222 and 236 are logically combinedeither in stages or all at once by one or more instances of responselogic 210 to determine a system-wide combined response (CR) 410 torequest 402. Subject to the scope restrictions discussed below, responselogic 210 provides combined response 410 to master 232 and snoopers 222,236 via its local interconnect 114 and/or system interconnect 110 toindicate the system-wide response (e.g., success, failure, retry, etc.)to request 402. If CR 410 indicates success of request 402, CR 410 mayindicate, for example, a data source for a requested memory block, acache state in which the requested memory block is to be cached bymaster 232, and whether “cleanup” operations invalidating the requestedmemory block in one or more L2 caches 230 are required.

In response to receipt of combined response 410, one or more of master232 and snoopers 222, 236 typically perform one or more operations inorder to service request 402. These operations may include supplyingdata to master 232, invalidating or otherwise updating the coherencystate of data cached in one or more L2 caches 230, performing castoutoperations, writing back data to a system memory 108, etc. If requiredby request 402, a requested or target memory block may be transmitted toor from master 232 before or after the generation of combined response410 by response logic 210.

In the following description, the partial response of a snooper 222, 236to a request and the operations performed by the snooper in response tothe request and/or its combined response will be described withreference to whether that snooper is a Highest Point of Coherency (HPC),a Lowest Point of Coherency (LPC), or neither with respect to therequest address specified by the request. An LPC is defined herein as amemory device or I/O device that serves as the repository for a memoryblock. In the absence of a HPC for the memory block, the LPC holds thetrue image of the memory block and has authority to grant or denyrequests to generate an additional cached copy of the memory block. Fora typical request in the data processing system embodiment of FIGS. 1and 2, the LPC will be the memory controller 206 for the system memory108 holding the referenced memory block. An HPC is defined herein as auniquely identified device that caches a true image of the memory block(which may or may not be consistent with the corresponding memory blockat the LPC) and has the authority to grant or deny a request to modifythe memory block. Descriptively, the HPC may also provide a copy of thememory block to a requester in response to an operation that does notmodify the memory block. Thus, for a typical request in the dataprocessing system embodiment of FIGS. 1 and 2, the HPC, if any, will bean L2 cache 230. Although other indicators may be utilized to designatean HPC for a memory block, a preferred embodiment of the presentinvention designates the HPC, if any, for a memory block utilizingselected cache coherency state(s) within the L2 cache directory 302 ofan L2 cache 230, as described further below with reference to Table II.

Still referring to FIG. 4, the HPC, if any, for a memory blockreferenced in a request 402, or in the absence of an HPC, the LPC of thememory block, preferably has the responsibility of protecting thetransfer of ownership of a memory block in response to a request 402during a protection window 404 a. In the exemplary scenario shown inFIG. 4, the snooper 236 that is the HPC for the memory block specifiedby the request address of request 402 protects the transfer of ownershipof the requested memory block to master 232 during a protection window404 a that extends from the time that snooper 236 determines its partialresponse 406 until snooper 236 receives combined response 410. Duringprotection window 404 a, snooper 236 protects the transfer of ownershipby providing partial responses 406 to other requests specifying the samerequest address that prevent other masters from obtaining ownershipuntil ownership has been successfully transferred to master 232. Master232 likewise initiates a protection window 404 b to protect itsownership of the memory block requested in request 402 following receiptof combined response 410.

Because snoopers 222, 236 all have limited resources for handling theCPU and I/O requests described above, several different levels ofpartial responses and corresponding CRs are possible. For example, if asnooper 222 within a memory controller 206 that is responsible for arequested memory block has a queue available to handle a request, thesnooper 222 may respond with a partial response indicating that it isable to serve as the LPC for the request. If, on the other hand, thesnooper 222 has no queue available to handle the request, the snooper222 may respond with a partial response indicating that is the LPC forthe memory block, but is unable to currently service the request.

Similarly, a snooper 236 in an L2 cache 230 may require an availableinstance of snoop logic and access to L2 cache directory 302 in order tohandle a request. Absence of access to either (or both) of theseresources results in a partial response (and corresponding CR) signalinga present inability to service the request due to absence of a requiredresource.

Hereafter, a snooper 222, 236 providing a partial response indicatingthat the snooper has available all internal resources required topresently service a request, if required, is said to “affirm” therequest. For snoopers 236, partial responses affirming a snoopedoperation preferably indicate the cache state of the requested or targetmemory block at that snooper 236. A snooper 222, 236 providing a partialresponse indicating that the snooper 236 does not have available allinternal resources required to presently service the request may be saidto be “possibly hidden” or “unable” to service the request. Such asnooper 236 is “possibly hidden” or “unable” to service a requestbecause the snooper 236, due to lack of an available instance of snooplogic or present access to L2 cache directory 302, cannot “affirm” therequest in sense defined above and has, from the perspective of othermasters 232 and snoopers 222, 236, an unknown coherency state.

III. Data Delivery Domains

Conventional broadcast-based data processing systems handle both cachecoherency and data delivery through broadcast communication, which inconventional systems is transmitted on a system interconnect to at leastall memory controllers and cache hierarchies in the system. As comparedwith systems of alternative architectures and like scale,broadcast-based systems tend to offer decreased access latency andbetter data handling and coherency management of shared memory blocks.

As broadcast-based system scale in size, traffic volume on the systeminterconnect is multiplied, meaning that system cost rises sharply withsystem scale as more bandwidth is required for communication over thesystem interconnect. That is, a system with m processor cores, eachhaving an average traffic volume of n transactions, has a traffic volumeof m×s, meaning that traffic volume in broadcast-based systems scalesmultiplicatively not additively. Beyond the requirement forsubstantially greater interconnect bandwidth, an increase in system sizehas the secondary effect of increasing some access latencies. Forexample, the access latency of read data is limited, in the worst case,by the combined response latency of the furthest away lower level cacheholding the requested memory block in a shared coherency state fromwhich the requested data can be sourced.

In order to reduce system interconnect bandwidth requirements and accesslatencies while still retaining the advantages of a broadcast-basedsystem, multiple L2 caches 230 distributed throughout data processingsystem 100 are permitted to hold copies of the same memory block in a“special” shared coherency state that permits these caches to supply thememory block to requesting L2 caches 230 using cache-to-cacheintervention. In order to implement multiple concurrent and distributedsources for shared memory blocks in an SMP data processing system, suchas data processing system 100, two issues must be addressed. First, somerule governing the creation of copies of memory blocks in the “special”shared coherency state alluded to above must be implemented. Second,there must be a rule governing which snooping L2 cache 230, if any,provides a shared memory block to a requesting L2 cache 230, forexample, in response to a bus read operation or bus RWITM operation.

According to the present invention, both of these issues are addressedthrough the implementation of data sourcing domains. In particular, eachdomain within a SMP data processing system, where a domain is defined toinclude one or more lower level (e.g., L2) caches that participate inresponding to data requests, is permitted to include only one cachehierarchy that holds a particular memory block in the “special” sharedcoherency state at a time. That cache hierarchy, if present when a busread-type (e.g., read or RWITM) operation is initiated by a requestinglower level cache in the same domain, is responsible for sourcing therequested memory block to the requesting lower level cache. Althoughmany different domain sizes may be defined, in data processing system100 of FIG. 1, it is convenient if each processing node 102 (i.e., MCM)is considered a data sourcing domain. One example of such a “special”shared state (i.e., Sr) is described below with reference to Table II.

IV. Coherency Domains

While the implementation of data delivery domains as described aboveimproves data access latency, this enhancement does not address the m×nmultiplication of traffic volume as system scale increases. In order toreduce traffic volume while still maintaining a broadcast-basedcoherency mechanism, preferred embodiments of the present inventionadditionally implement coherency domains, which like the data deliverydomains hereinbefore described, can conveniently (but are not requiredto be) implemented with each processing node 102 forming a separatecoherency domain. Data delivery domains and coherency domains can be,but are not required to be coextensive, and for the purposes ofexplaining exemplary operation of data processing system 100 willhereafter be assumed to have boundaries defined by processing nodes 102.

The implementation of coherency domains reduces system traffic bylimiting inter-domain broadcast communication over system interconnect110 in cases in which requests can be serviced with participation byfewer than all coherency domains. For example, if processing unit 104 aof processing node 102 a has a bus read operation to issue, thenprocessing unit 104 a may elect to first broadcast the bus readoperation to all participants within its own coherency domain (e.g.,processing node 102 a), but not to participants in other coherencydomains (e.g., processing node 102 b). A broadcast operation transmittedto only those participants within the same coherency domain as themaster of the operation is defined herein as a “local operation”. If thelocal bus read operation can be serviced within the coherency domain ofprocessing unit 104 a, then no further broadcast of the bus readoperation is performed. If, however, the partial responses and combinedresponse to the local bus read operation indicate that the bus readoperation cannot be serviced solely within the coherency domain ofprocessing node 102 a, the scope of the broadcast may then be extendedto include, in addition to the local coherency domain, one or moreadditional coherency domains.

In a basic implementation, two broadcast scopes are employed: a “local”scope including only the local coherency domain and a “global” scopeincluding all of the other coherency domains in the SMP data processingsystem. Thus, an operation that is transmitted to all coherency domainsin an SMP data processing system is defined herein as a “globaloperation”. Importantly, regardless of whether local operations oroperations of more expansive scope (e.g., global operations) areemployed to service operations, cache coherency is maintained across allcoherency domains in the SMP data processing system. Examples of localand global operations are described in detail in U.S. patent applicationSer. No. 11/055,697, which is incorporated herein by reference in itsentirety.

In a preferred embodiment, the scope of an operation is indicated in abus operation by a local/global scope indicator (signal), which in oneembodiment may comprise a 1-bit flag. Forwarding logic 212 withinprocessing units 104 preferably determines whether or not to forward anoperation, received via local interconnect 114 onto system interconnect110 based upon the setting of the local/global scope indicator (signal)in the operation.

V. Domain Indicators

In order to limit the issuance of unneeded local operations and therebyreduce operational latency and conserve additional bandwidth on localinterconnects, the present invention preferably implements a domainindicator per memory block that indicates whether or not a copy of theassociated memory block is cached outside of the local coherency domain.For example, FIG. 5 depicts a first exemplary implementation of a domainindicator in accordance with the present invention. As shown in FIG. 5,a system memory 108, which may be implemented in dynamic random accessmemory (DRAM), stores a plurality of memory blocks 500. System memory108 stores in association with each memory block 500 an associated errorcorrecting code (ECC) 502 utilized to correct errors, if any, in memoryblock 500 and a domain indicator 504. Although in some embodiments ofthe present invention, domain indicator 504 may identify a particularcoherency domain (i.e., specify a coherency domain or node ID), it ishereafter assumed that domain indicator 504 is a 1-bit indicator that isset (e.g., to ‘1’ to indicate “local”) if the associated memory block500 is cached, if at all, only within the same coherency domain as thememory controller 206 serving as the LPC for the memory block 500.Domain indicator 504 is reset (e.g., to ‘0’ to indicate “global”)otherwise. The setting of domain indicators 504 to indicate “local” maybe implemented imprecisely in that a false setting of “global” will notinduce any coherency errors, but may cause unneeded global broadcasts ofoperations.

Memory controllers 206 (and L2 caches 230) that source a memory block inresponse to an operation preferably transmit the associated domainindicator 504 in conjunction with the requested memory block.

VI. Exemplary Coherency Protocol

The present invention preferably implements a cache coherency protocoldesigned to leverage the implementation of data delivery and coherencydomains as described above. In a preferred embodiment, the cachecoherency states within the protocol, in addition to providing (1) anindication of whether a cache is the HPC for a memory block, alsoindicate (2) whether the cached copy is unique (i.e., is the only cachedcopy system-wide) among caches at that memory hierarchy level, (3)whether and when the cache can provide a copy of the memory block to amaster of a request for the memory block, (4) whether the cached imageof the memory block is consistent with the corresponding memory block atthe LPC (system memory), and (5) whether another cache in a remotecoherency domain (possibly) holds a cache entry having a matchingaddress. These five attributes can be expressed, for example, in anexemplary variant of the well-known MESI (Modified, Exclusive, Shared,Invalid) protocol summarized below in Table II.

TABLE II Cache Consistent Cached outside Legal concurrent state HPC?Unique? Data source? with LPC? local domain? states M yes yes yes,before no no I, Ig, In (& LPC) CR Me yes yes yes, before yes no I, Ig,In (& LPC) CR T yes unknown yes, after CR no unknown Sr, S, I, Ig, In (&if none LPC) provided before CR Tn yes unknown yes, after CR no no Sr,S, I, Ig, In (& if none LPC) provided before CR Te yes unknown yes,after CR yes unknown Sr, S, I, Ig, In (& if none LPC) provided before CRTen yes unknown yes, after CR yes no Sr, S, I, Ig, In (& if none LPC)provided before CR Sr no unknown yes, before unknown unknown T, Tn, Te,Ten, CR S, I, Ig, In (& LPC) S no unknown no unknown unknown T, Tn, Te,Ten, Sr, S, I, Ig, In (& LPC) I no n/a no n/a unknown M, Me, T, Tn, Te,Ten, Sr, S, I, Ig, In (& LPC) Ig no n/a no n/a Assumed so, in M, Me, T,Tn, absence of other Te, Ten, Sr, S, I, information Ig, In (& LPC) In non/a no n/a Assumed not, in M, Me, T, Tn, absence of other Te, Ten, Sr,S, I, information Ig, In (& LPC)

A. Ig State

In order to avoid having to access the LPC to determine whether or notthe memory block is known to be cached, if at all, only locally, the Ig(Invalid global) coherency state is utilized to maintain a domainindication in cases in which no copy of a memory block remains cached ina coherency domain. The Ig state is defined herein as a cache coherencystate indicating (1) the associated memory block in the cache array isinvalid, (2) the address tag in the cache directory is valid, and (3) acopy of the memory block identified by the address tag may possibly becached in another coherency domain. The Ig indication is preferablyimprecise, meaning that it may be incorrect without a violation ofcoherency.

The Ig state is formed in a lower level cache in response to that cacheproviding a requested memory block to a requester in another coherencydomain in response to an exclusive access request (e.g., a bus RWITMoperation). In some embodiments of the present invention, it may bepreferable to form the Ig state only in the coherency domain containingthe LPC for the memory block. In such embodiments, some mechanism (e.g.,a partial response by the LPC and subsequent combined response) must beimplemented to indicate to the cache sourcing the requested memory blockthat the LPC is within its local coherency domain. In other embodimentsthat do not support the communication of an indication that the LPC islocal, an Ig state may be formed any time that a cache sources a memoryblock to a remote coherency domain in response to an exclusive accessrequest.

Because cache directory entries including an Ig state carry potentiallyuseful information, it is desirable in at least some implementations topreferentially retain entries in the Ig state over entries in the Istate (e.g., by modifying the Least Recently Used (LRU) algorithmutilized to select a victim cache entry for replacement). As Igdirectory entries are retained in cache, it is possible for some Igentries to become “stale” over time in that a cache whose exclusiveaccess request caused the formation of the Ig state may deallocate orwriteback its copy of the memory block without notification to the cacheholding the address tag of the memory block in the Ig state. In suchcases, the “stale” Ig state, which incorrectly indicates that a globaloperation should be issued instead of a local operation, will not causeany coherency errors, but will merely cause some operations, which couldotherwise be serviced utilizing a local operation, to be issued asglobal operations. Occurrences of such inefficiencies will be limited induration by the eventual replacement of the “stale” Ig cache entries andby domain indication scrubbing, as described further below.

Several rules govern the selection and replacement of Ig cache entries.First, if a cache selects an Ig entry as the victim for replacement, acastout of the Ig entry is performed (unlike the case when an I entry isselected). Second, if a request that causes a memory block to be loadedinto a cache hits on an Ig cache entry in that same cache, the cachetreats the Ig hit as a cache miss and performs a castout operation withthe Ig entry as the selected victim. The cache thus avoids avoid placingtwo copies of the same address tag in the cache directory. Third, thecastout of the Ig state is preferably performed as a local operation, orif performed as a global operation, ignored by memory controllers ofnon-local coherency domains. If an Ig entry is permitted to form in acache that is not within the same coherency domain as the LPC for thememory block, no update to the domain indicator in the LPC is required.Fourth, the castout of the Ig state is preferably performed as adataless address-only operation in which the domain indicator is writtenback to the LPC (if local to the cache performing the castout).

Implementation of an Ig state in accordance with the present inventionimproves communication efficiency by maintaining a cached domainindicator for a memory block in a coherency domain even when no validcopy of the memory block remains cached in the coherency domain. As aconsequence, an HPC for a memory block can service an exclusive accessrequest (e.g., bus RWITM operation) from a remote coherency domainwithout retrying the request and performing a push of the requestedmemory block to the LPC.

B. In State

The In state is defined herein as a cache coherency state indicating (1)the associated memory block in the cache array is invalid, (2) theaddress tag in the cache directory is valid, and (3) a copy of thememory block identified by the address tag is likely cached, if at all,only by one or more other cache hierarchies within the local coherencydomain. The In indication is preferably imprecise, meaning that it maybe incorrect without a violation of coherency. The In state is formed ina lower level cache in response to that cache providing a requestedmemory block to a requestor in the same coherency domain in response toan exclusive access request (e.g., a bus RWITM operation).

Because cache directory entries including an In state carry potentiallyuseful information, it is desirable in at least some implementations topreferentially retain entries in the In state over entries in the Istate (e.g., by modifying the Least Recently Used (LRU) algorithmutilized to select a victim cache entry for replacement). As Indirectory entries are retained in cache, it is possible for some Inentries to become “stale” over time in that a cache whose exclusiveaccess request caused the formation of the In state may itself supply ashared copy of the memory block to a remote coherency domain withoutnotification to the cache holding the address tag of the memory block inthe In state. In such cases, the “stale” In state, which incorrectlyindicates that a local operation should be issued instead of a globaloperation, will not cause any coherency errors, but will merely causesome operations to be erroneously first issued as local operations,rather than as global operations. Occurrences of such inefficiencieswill be limited in duration by the eventual replacement of the “stale”In cache entries. In a preferred embodiment, cache entries in the Incoherency state are not subject to castout, but are instead simplyreplaced. Thus, unlike Ig cache entries, In cache entries are notutilized to update domain indicators 504 in system memories 108.

Implementation of an In state in accordance with the present inventionimproves communication efficiency by maintaining a cached domainindicator for a memory block that may be consulted by a master in orderto select a local scope for one of its operations. As a consequence,bandwidth on system interconnect 110 and local interconnects 114 inother coherency domains is conserved.

C. Sr State

In the operations described below, it is useful to be able to determinewhether or not a lower level cache holding a shared requested memoryblock in the Sr coherency state is located within the same domain as therequesting master. In one embodiment, the presence of a “local” Srsnooper within the same domain as the requesting master can be indicatedby the response behavior of a snooper at a lower level cache holding arequested memory block in the Sr coherency state. For example, assumingthat each bus operation includes a range indicator indicating whetherthe bus operation has crossed a domain boundary (e.g., an explicitdomain identifier of the master or a single local/not local range bit),a lower level cache holding a shared memory block in the Sr coherencystate can provide a partial response affirming the request in the Srstate only for requests by masters within the same data sourcing domainand provide partial responses indicating the S state for all otherrequests. In such embodiments the response behavior can be summarized asshown in Table III, where prime (′) notation is utilized to designatepartial responses that may differ from the actual cache state of thememory block.

TABLE III Partial response Partial response Cache (adequate (adequateDomain of master of state in resources resources read-type requestdirectory available) unavailable) “local” (i.e., within Sr Sr′ affirmSr′ possibly hidden same domain) “remote” (i.e., not Sr S′ affirm S′possibly hidden within same domain) “local” (i.e., within S S′ affirm S′possibly hidden same domain) “remote” (i.e., not S S′ affirm S′ possiblyhidden within same domain)Assuming the response behavior set forth above in Table III, the averagedata latency for shared data can be significantly decreased byincreasing the number of shared copies of memory blocks distributedwithin an SMP data processing system that may serve as data sources.

VII. Exemplary Operation

With reference first to FIG. 6, there is depicted a high level logicalflowchart of an exemplary method of servicing a processor (CPU) requestin a data processing system in accordance with the present invention. Asshown, the process begins at block 600, which represents a master 232 inan L2 cache 230 receiving a CPU request (e.g., a CPU data load request,a CPU data store request, a CPU load-and-reserve request, a CPUinstruction load request, etc.) from an associated processor core 200 inits processing unit 104. In response to receipt of the CPU request,master 232 determines at block 602 whether or not the target memoryblock, which is identified within the CPU request by a target address,is held in L2 cache directory 302 in a coherency state that permits theCPU request to be serviced without issuing a bus operation on theinterconnect fabric. For example, a CPU instruction fetch request ordata load request can be serviced without issuing a bus operation on theinterconnect fabric if L2 cache directory 302 indicates that thecoherency state of the target memory block is any of the M, Me, Tx(e.g., T, Tn, Te or Ten), Sr or S states. A CPU data store request canbe serviced without issuing a bus operation on the interconnect fabricif L2 cache directory 302 indicates that the coherency state of thetarget memory block is one of the M or Me states. If master 232determines at block 602 that the CPU request can be serviced withoutissuing a bus operation on the interconnect fabric, master 232 accessesL2 cache array 300 to service the CPU request, as shown at block 624.For example, master 232 may obtain a requested memory block and supplythe requested memory block to the requesting processor core 200 inresponse to a CPU data load request or instruction fetch request or maystore data provided in a CPU data store request into L2 cache array 300.Following block 624, the process terminates at block 626.

Returning to block 602, if the target memory block is not held in L2directory 302 in a coherency state that permits the CPU request to beserviced without issuing a bus operation on the interconnect fabric, adetermination is also made at block 604 whether or not a castout of anexisting cache line is required to accommodate the target memory blockin L2 cache 230. In one embodiment, a castout operation is required atblock 604 if a memory block is selected for eviction from the L2 cache230 of the requesting processor in response to the CPU request and ismarked in L2 directory 302 as being in any of the M, T, Te, Tn or Igcoherency states. In response to a determination at block 604 that acastout is required, a cache castout operation is performed, asindicated at block 606. Concurrently, the master 232 determines at block610 a scope of a bus operation to be issued to service the CPU request.For example, in one embodiment, master 232 determines at block 610whether to broadcast a bus operation as a local operation or a globaloperation.

In a first embodiment in which each bus operation is initially issued asa local operation and issued as a local operation only once, thedetermination depicted at block 610 can simply represent a determinationby the master of whether or not the bus operation has previously beenissued as a local bus operation. In a second alternative embodiment inwhich local bus operations can be retried, the determination depicted atblock 610 can represent a determination by the master of whether or notthe bus operation has previously been issued more than a thresholdnumber of times. In a third alternative embodiment, the determinationmade at block 610 can be based upon a prediction by the master 232 ofwhether or not a local bus operation is likely to be successful inresolving the coherency of the target memory block without communicationwith processing nodes in other coherency domains. An exemplaryimplementation of this third alternative embodiment is described ingreater detail below with reference to FIGS. 7-9.

In response to a determination at block 610 to issue a global busoperation rather than a local bus operation, the process proceeds fromblock 610 to block 620, which is described below. If, on the other hand,a determination is made at block 610 to issue a local bus operation,master 232 initiates a local bus operation on its local interconnect114, as illustrated at block 612. The local bus operation is broadcastonly within the local coherency domain (e.g., processing node 102)containing master 232. If master 232 receives a CR indicating “Success”(block 614), the process passes to block 623, which represents master232 updating the predictor utilized to make the scope selection depictedat block 610. In addition, master 232 services the CPU request, as shownat block 624. Thereafter, the process ends at block 626.

Returning to block 614, if the CR for the local bus read operation doesnot indicate “Success”, master 232 makes a determination at block 616whether or the CR is a “Retry Global” CR that definitively indicatesthat the coherency protocol mandates the participation of one or moreprocessing nodes outside the local coherency domain and that the busoperation should therefore be reissued as a global bus operation. If so,the process passes to block 620, which is described below. If, on theother hand, the CR is a “Retry” CR that does not definitively indicatethat the bus operation cannot be serviced within the local coherencydomain, the process returns from block 616 to block 610, whichillustrates master 232 again determining whether or not to issue a localbus operation to service the CPU request. In this case, master 232 mayemploy in the determination any additional information provided by theCR. Following block 610, the process passes to either block 612, whichis described above, or to block 620.

Block 620 depicts master 230 issuing a global bus operation to allprocessing nodes 102 in data processing system in order to service theCPU request. If the CR of the global bus read operation does notindicate “Success” at block 622, master 232 reissues the global busoperation at block 620 until a CR indicating “Success” is received. Ifthe CR of the global bus read operation indicates “Success”, the processproceeds to block 623 and following blocks, which have been described.

Thus, assuming affinity between processes and their data within the samecoherency domain, CPU requests can frequently be serviced utilizingbroadcast communication limited in scope to the coherency domain of therequesting master or of other restricted scope less than a full globalscope. The combination of data delivery domains as hereinbeforedescribed and coherency domains thus improves not only data accesslatency, but also reduces traffic on the system interconnect (and otherlocal interconnects) by limiting the scope of broadcast communication.

VIII. Scope Prediction

With reference now to FIG. 7, there is illustrated a block diagramrepresentation of an exemplary embodiment of scope prediction logic 250within the master 232 of an L2 cache 230 in accordance with oneembodiment of the present invention. As noted above, scope predictionlogic 250 may be employed to perform the scope selection illustrated atblocks 610, 614 and 616 of FIG. 6.

In a preferred embodiment, scope prediction logic 250 includesunillustrated logic for generating static predictions of the scopes ofbroadcast bus operations. In one embodiment, scope prediction logic 250generates the static prediction based upon the transaction type (TTYPE)of the bus operation (e.g., read, RWITM, DClaim, DCBZ, write, partialwrite, etc.) to be issued and the current coherency state of the targetmemory block of the bus operation in the local L2 cache directory 302.

As further illustrated in FIG. 7, scope prediction logic 250 mayadvantageously include history-based prediction logic 700, whichgenerates scope predictions for bus operations based upon on the actualscopes of previous bus operations. Because different classes of busoperations tend to exhibit different behaviors, history-based predictionlogic 700 separately records historical information for differentclasses of bus operations within the various predictors 704 a-704 n of apredictor array 702. In general, if the operations classes are properlyconstructed, the past behavior of bus operations within each class willserve as an accurate predictor of the scope of future bus operationswithin the same class.

In one embodiment, each predictor 704 is implemented as a counter.Assuming good software affinity, a large majority of bus operations ineach operation class should be able to be serviced utilizing only localbus operations. Accordingly, in one embodiment, each counter 704 isinitialized to an initial value representing a global operation scope,is updated by update logic 714 for each consecutive bus operation in theassociated class that is serviced entirely within the local coherencydomain until a threshold (e.g., 3) is reached, and thereafter indicatesa local operation scope for bus operations in the associated class untila bus operation in the associated class is serviced by a participantoutside the local processing node 102. In that case, the predictor 704is reset by update logic 714 to its initial value. Thus, in thisembodiment, predictors 704 saturate slowly to the prediction of localscope for bus operations, but react quickly to the infrequent occurrenceof global bus operations. In other embodiments, predictors 704 may, ofcourse, simply decrement in response to a global bus operation so thatpredictors 704 saturate to global and local scope predictions at thesame rate.

Although good software affinity is typical, in some cases, particularmemory blocks or memory pages may exhibit weaker affinity and thereforerequire a large proportion of global bus operations. Accordingly,history-based prediction logic 700 may optionally include mode field708, which may be set by hardware (e.g., master 232) or software (e.g.,system firmware) to cause one or more of predictors 704 a-704 n tooperate with, or be interpreted as having a reversed bias. With areversed bias, the initial value of a predictor 704 represents aprediction of local operation scope, the predictor 704 saturates to anindication of global operation scope after a threshold number ofoperations (e.g., 3) are resolved within the local coherency domain, andthe predictor 704 is reset by update logic 714 to a prediction of localscope upon an operation in the associated class being serviced withinthe local coherency domain.

As will be appreciated, the classes corresponding to predictors 704a-704 n can be constructed utilizing any of a large number of sets ofcriteria. In one embodiment, these criteria form a set of read inputs720 and a set of update inputs 730 including a thread identifier (TID),the transaction type (TTYPE) of the bus operation (e.g., read, RWITM,DClaim, DCBZ, write, partial write, etc.), an instruction/data (I/D)indication indicating whether the contents of the target memory blockare instructions or data, an atomic indication indicating whether therequested data access relates to an atomic memory update (e.g., whetherthe CPU request was triggered by the execution of a load-and-reserve orstore-conditional instruction by the source processor core 200), and anLPC indication.

The TID, which is preferably received from a processor core 200 as partof, or in conjunction with a CPU request, uniquely identifies theprocessor thread that issued the CPU request to be serviced. In anembodiment in which multiple processor cores 200 share an L2 cache 230,the TID preferably includes a processor core identifier so that threadsof the different processor cores 200 can be disambiguated. For example,for embodiments of processing units 104 including two processor cores200 that each support two simultaneous hardware threads, the TID may beimplemented with 2 bits: 1 bit to identify the source processor core 200and 1 bit to identify which thread of the processor core 200 issued theCPU request.

The I/D indication is also preferably received by L2 cache 230 from aprocessor core 200 as part of, or in conjunction with a CPU request. TheI/D indication may be generated by an L1 cache 226 based upon whetherthe CPU request arose from an instruction fetch miss or a data access.

The LPC indication provides an indication of whether or not the LPC forthe target memory block resides within the local coherency domaincontaining the L2 cache 230. The LPC indication may be generated, forexample, by BAR register 252 of master 232 in a conventional manner.

From the set of read inputs 720 and update inputs 730, operation classesare constructed based at least partially upon a binary expansion of anindex including at least a TTYPE_group field, a TID field, and an LPCfield.

The TTYPE_group field identifies a particular group of TTYPEs into whicha bus operation falls. In one embodiment, a larger number of TTYPEs ofbus operations are represented by a fewer number of TTYPE_groups. TheTTYPE_groups may be constructed based upon not only bus operationTTYPEs, but also other information such as the I/D and atomicindications. For example, in one embodiment, the various possible busoperations are represented by four TTYPE_groups—instruction fetch, datafetch, load-and-reserve, and store—which can be advantageously encodedas a 2-bit TTYPE_group field.

As shown in FIG. 7, history-based prediction logic 700 includes indexgeneration logic 712 for generating read and update indexes utilized toselectively access the predictors 704 within predictor array 702corresponding to particular operation classes. In an embodimentimplementing the four TTYPE_groups defined above, index generation logic712 generates the 2-bit TTYPE_group field of a read or update index fromthe bus operation TTYPE and the I/D and atomic indications in accordancewith Table IV below (a dash (‘-’) represents a “don't care”). Indexgeneration logic 712 then forms the complete index by concatenating theTTYPE_group field with the TID and LPC indications.

TABLE IV TTYPE I/D Atomic TTYPE_group READ I No instruction fetch READ DNo data fetch READ D Yes load-and-reserve RWITM(Read-With-Intent-To-Modify) D — store DCLAIM (Data Claim) D — storeDCBZ (Data Cache Block Zero) D — store

Assuming that index generation logic 712 generates 5-bit indexesincluding a 2-bit TTYPE_group field, a 2-bit TID field and a 1-bit LPCfield, predictor array 702 may support history-based scope predictionfor 32 (i.e., 2⁵) operation classes each having a respective predictor704. The update index generated by index generation logic 712 can beemployed by a decoder 706 to update the value of a particular predictor704, and the read index can be used by an N-to-1 multiplexer 710 tooutput the scope prediction of a particular predictor 704. Of course,additional classes and index bits may be implemented based upon otherclass criteria, for example, bit subranges of the target memory address,etc.

It should further be noted that the number of predictors 704 may, butneed not double for each additional bit included within the read andupdate indexes. Instead, a single counter 704 may be established inassociation with a particular criteria represented by a dominant bit inthe indexes. Decoder 706 and multiplexer 710 may further be implementedto access that corresponding counter 704 when the dominant bit isasserted, irrespective of the values of the other index bits. Such animplementation would be advantageous and desirable in cases in which aparticular class criterion is likely to be more determinative of actualscope outcomes than other index bits.

Referring now to FIG. 8, there is depicted an exemplary method of scopeprediction performed by scope prediction logic 250 in accordance with apreferred embodiment of the present invention. As illustrated, theprocess begins at block 800, for example, in response to receipt byscope prediction logic 250 of a TTYPE of a bus operation to be issued,the local coherency state of the target address of the bus operation inthe L2 cache directory 302, and a set of read inputs 720, at block 610of FIG. 6. The process then proceeds to block 802, which illustratesscope prediction logic 250 determining if the TTYPE input indicates thatthe bus operation to be issued by master 232 is a bus read, bus RWITM orbus DCBZ operation. If not, the process proceeds to block 810, which isdescribed below. If, on the other hand, the TTYPE input indicates thatthe bus operation to be issued is a bus read, bus RWITM or bus DCBZoperation, unillustrated logic within scope prediction logic 250preferentially predicts the scope of the bus operation based upon thelocal coherency state of target memory block, if possible.

That is, if the coherency state input indicates that the coherency stateof the target address with respect to the local L2 cache directory 302is In, scope prediction logic 250 predicts a local scope for the busoperation, as shown at blocks 804 and 822. Alternatively, if thecoherency state input indicates that the coherency state of the targetaddress with respect to the local L2 cache directory 302 is Ig, scopeprediction logic 250 predicts a global scope for the bus operation, asshown at blocks 808 and 814. Alternatively, if the target address is notassociated with an In or Ig coherency state in the L2 cache directory302, scope prediction logic 250 preferably predicts the scope of the busoperation utilizing history-based prediction logic 700, as depicted atblock 820 and described in greater detail below with reference to FIG.9.

Referring now to block 810, if scope prediction logic 250 determinesthat the bus operation to be issued is a bus write or bus castoutoperation, unillustrated logic within scope prediction logic 250preferably predicts the scope of the bus operation based upon the LPCinput, as illustrated at block 812. Thus, scope prediction logic 250predicts a global scope for the bus operation (block 814) if the LPCinput indicates that the LPC for the target address is not within thelocal processing node 104, and predicts a local scope for the busoperation (block 822) otherwise.

Referring again to block 810, if the TTYPE input indicates that the busoperation is another type of operation, for example, a bus DClaimoperation, scope prediction logic 250 preferably predicts a scope forthe bus operation utilizing history-based prediction logic 700, asillustrated at block 820. If such cases, scope prediction logic 250provides a local scope prediction (block 822) if history-basedprediction logic 700 indicates a local scope and provides a global scopeprediction (block 814) if history-based prediction logic 700 indicates aglobal scope.

With reference now to FIG. 9, there is illustrated a more detailedlogical flowchart of an exemplary process of history-based scopeprediction in accordance with the present invention. In the embodimentof FIG. 7, the illustrated process is implemented by history-basedprediction logic 700.

As depicted, the process begins at block 900 and thereafter proceeds toblocks 902 and 904, which respectively depict the initialization of modefield 708 and predictors 704, for example, as part of hardware power-onreset operations and/or firmware initialization procedures. Thereafter,the process trifurcates and proceeds in parallel to each of blocks 906,920 and 930.

Block 906 represents history-based prediction logic 700 iterating untila set of read inputs 720 associated with a prospective bus operation tobe issued is received. When a set of read inputs 720 is received, indexgeneration logic 712 generates a read index, as depicted at block 908.In response to receipt of the read index, multiplexer 710 selects andoutputs from predictor array 702 the value of a particular predictor 704corresponding to the operation class identified by the read index, asshown at block 910. If scope prediction logic 250 has selectedhistory-based prediction for the current bus operation, for example, inaccordance with the method of FIG. 8, scope prediction logic 250determines the scope prediction by reference to the predictor value andthe value of mode field 708, if present. For example, assuming the modefield 708, if present, is set so that the relevant predictor 704 has adefault bias, scope prediction logic 250 predicts a global scope if thepredictor value is below the saturating threshold and predicts a localscope if the predictor value is at or above the saturating threshold.The prediction is reversed if the mode field 708 is set so that therelevant predictor 704 has a reverse bias. Following block 910, theprocess returns to block 906.

Referring now to block 920, history-based prediction logic 700 iteratesat block 920 until a set of update inputs 730 is received from master232 that describe a bus operation for which a combined responseindicating “Success” has been received on the local interconnect 114.(Master 232 maintains state for each bus operation until it completessuccessfully.) In response to receipt of the set of update inputs 730,index generation logic 712 generates an update index for the busoperation for which the combined response was received, as indicated atblock 922. Next, as illustrated at block 924, update logic 714 utilizesthe combined response that was received for the bus operation togenerate an update for a predictor 704, which update is applied to thepredictor 704 selected by decoder 706 in response to receipt of theupdate index from index generation logic 712. In particular, if the“Success” CR indicates that the bus operation was serviced by a snooper122, 236 in the local coherency domain, update logic 714 outputs acounter increment signal. If the “Success” CR indicates that the busoperation was serviced by a snooper 122, 236 outside of the localcoherency domain, update logic 714 outputs a counter reset signal. Theinterpretation of these update signals is reversed if mode field 708indicates that the counter 704 to which the update signal is to beapplied is operating with a reversed bias. Following block 922, theprocess returns to block 920.

With reference now to block 930, history-based prediction logic 700iterates at block 930 until an update to mode field 930 is received. Inresponse to receipt of an update to mode field 708, history-basedprediction logic 700 updates mode field 708 to correctly reflect whichpredictors 704 are operating with a forward bias and which predictors704 are operating with a reversed bias, as indicated at block 902. Inaddition, the predictor(s) 704 affected by the update to mode field 704are initialized at block 904. Thereafter, the process thereafter returnsto block 930.

As has been described, the present invention provides an improved methodand system for selecting or predicting a scope of a broadcast operationtransmitted on an interconnect of a data processing system. Inaccordance with the present invention, the scope of at least somebroadcast operations are predicted by reference to the actual scopes ofprevious successful broadcast operations. History-based prediction maybe enhanced by maintaining separate historical indications of operationscope for different classes of operations.

While the invention has been particularly shown as described withreference to a preferred embodiment, it will be understood by thoseskilled in the art that various changes in form and detail may be madetherein without departing from the spirit and scope of the invention.

1. A cache memory for a data processing system including an interconnectfabric and at least first and second coherency domains each containingat least one processing unit, said cache memory comprising: a dataarray; a cache directory of contents of said data array; and a cachecontroller including scope prediction logic, said scope prediction logicfurther including: a plurality of predictors each indicating ahistorical scope of broadcast on the interconnect fabric for one or moreprevious operations of a respective one of a plurality of operationclasses; index generation logic that, responsive to receipt of a set ofread input characterizing a prospective bus operation to be issued,generates a read index to select a scope prediction indication by one ofsaid plurality of predictors for use in transmission of said prospectivebus operation; and update logic that, responsive to receipt of acombined response to a successfully completed bus operation, generatesan update to one of said plurality of predictors in accordance with alocation of a servicing snooper in said data processing system.
 2. Adata processing system, comprising: at least first and second coherencydomains each containing at least one processing unit including a cachememory; an interconnect fabric coupling said first and second coherencydomains; and scope prediction logic within the first coherency domain,said scope prediction logic including a predictor that indicates ahistorical scope of broadcast for one or more previous operationstransmitted on the interconnect fabric, wherein said scope predictionlogic predictively selects a scope of broadcast of a subsequentoperation by reference to said predictor if said operation is of a firstpredetermined operation type and predictively selects a scope ofbroadcast of said operation by reference to a coherency state of atarget memory address of said operation if said operation is of a secondpredetermined operation type.
 3. The data processing system of claim 2,wherein said scope prediction logic predictively selects a first scopeof broadcast including both said first and second coherency domains inresponse to a first setting of said predictor and selects a second scopeof broadcast including said first coherency domain and excluding saidsecond coherency domain in response to a second setting of saidpredictor.
 4. The data processing system of claim 2, wherein said cachememory of said first coherency domain includes said scope predictionlogic.
 5. The data processing system of claim 2, wherein said scopeprediction logic includes a plurality of predictors that each indicatesa historical scope of broadcast of operations in a respective one of aplurality of operation classes.
 6. The data processing system of claim2, wherein said predictor comprises a saturating counter that saturatestoward a prediction of a narrower broadcast scope.
 7. The dataprocessing system of claim 6, wherein said scope prediction logicincludes a mode field, and wherein said scope prediction logic reversesa prediction indicated by said saturating counter in response to asetting of said mode field.
 8. A cache memory for a data processingsystem including an interconnect fabric and at least first and secondcoherency domains each containing at least one processing unit, saidcache memory comprising: a data array; a cache directory of contents ofsaid data array; and a cache controller including scope predictionlogic, wherein said scope prediction logic includes a predictor thatindicates a historical scope of broadcast for one or more previousoperations transmitted by said cache memory on the interconnect fabric,wherein said scope prediction logic predictively selects a scope ofbroadcast of a subsequent operation by reference to said predictor ifsaid operation is of a first predetermined operation type andpredictively selects a scope of broadcast of said operation by referenceto a coherency state of a target memory address of said operationrecorded in said cache directory if said operation is of a secondpredetermined operation type.
 9. The cache memory of claim 8, whereinsaid scope prediction logic predictively selects a first scope ofbroadcast including both said first and second coherency domains inresponse to a first setting of said predictor and selects a second scopeof broadcast including said first coherency domain and excluding saidsecond coherency domain in response to a second setting of saidpredictor.
 10. The cache memory of claim 8, wherein said scopeprediction logic includes a plurality of predictors that each indicatesa historical scope of broadcast of operations in a respective one of aplurality of operation classes.
 11. The cache memory of claim 8, whereinsaid predictor comprises a saturating counter that saturates toward aprediction of a narrower broadcast scope.
 12. A processing unit,comprising: a cache memory in accordance with claim 8; and at least oneprocessor core coupled to said cache memory.